Reinforcement Learning: An Introduction: The Agent

The Interaction Loop

In the theater of Reinforcement Learning, the Agent and the Environment perform a continuous dance. At each discrete time step $t$, the agent receives a representation of the environment's State ($S_t$). Based on this, the agent selects an Action ($A_t$). One step later, as a consequence of its action, the agent receives a numerical Reward ($R_{t+1}$) and finds itself in a new state ($S_{t+1}$).

The Finite MDP Framework

A Finite Markov Decision Process (Finite MDP) is the mathematical bedrock of this interaction. It assumes that the sets of states, actions, and rewards are finite. This allows us to define the dynamics of the environment through a single probability distribution: $p(s', r | s, a) = \Pr\{S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a\}$.

Agent-Environment Boundary: This is not a physical shell but a functional one. If the agent cannot change it arbitrarily, it is part of the environment. The motor of a robot is the environment; the control software is the agent.
Episodic Tasks: The interaction breaks into finite sequences called Episodes, ending in a Terminal State (e.g., a game over).
Continuing Tasks: Interactions that go on forever without a natural end (e.g., process control).

Analytical Tasks

Conceptual Exercises on Value Functions

Consider the golf example where an agent must choose between a driver and a putter to reach a hole.

Draw or describe the optimal state-value function for the golf example.

Answer:
The optimal state-value function v*(s) represents the negative of the minimum number of strokes required to reach the hole from state 's'. For example, if the agent is on the green and can reach the hole in one stroke using the putter, v*(s) = -1. If on the fairway and it requires one driver shot plus one putter shot, v*(s) = -2.

What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function qπ and its successive approximation by a sequence of functions q₀, q₁, q₂, ...?

Answer:
The iterative update for the action-value function is given by: q_{k+1}(s, a) = Σ_{s', r} p(s', r | s, a) [r + γ Σ_{a'} π(a'|s') q_k(s', a')]. This expresses the value of taking action 'a' in state 's' as the expected immediate reward plus the discounted expected value of the following state, averaged over all possible next actions under policy π.